Suicide Understanding and Prevention
This notebook contains the analysis of a suicide dataset, which was the final assignment of the Statistical data analysis course.
I begin by inspecting whether the dataset contains any NA values that would need to be taken care of.
data %>% is.na() %>% colSums()
## country year sex age suicides_no
## 0 0 0 0 0
## population gdp_for_year gdp_per_capita generation continent
## 0 0 0 0 0
It seems that the dataset has already been conveniently cleaned beforehand.
Did all of the included countries contribute the same amount of data? Are there any countries that reported their statistics far less often than other countries?
years.count <- data %>% group_by(continent, country) %>%
summarise(count = n_distinct(year)) %>%
arrange(desc(count))
## `summarise()` has grouped output by 'continent'. You can override using the
## `.groups` argument.
years.count %>%
ggplot(aes(x = count, fill = continent)) +
scale_x_continuous(n.breaks = 31) +
geom_bar()
It looks that a significant number of countries provided data about the number suicides within their population most of the years during 1985-2015. However there are some countries that supplied only less than 10 year worth of data.
Throughout the majority of this notebook, I will be working with suicide rates rather than with concrete numbers of total suicides. The reason for this is that it allows us to perform better comparisons of suicide prevalence between different countries and continents, as the populations of different areas differ, and the total number of suicides in a population is correlated with the size of the population.
country.summary <- data %>% group_by(year) %>%
summarise(population = sum(population),
suicides_no = sum(suicides_no))
cor(x = country.summary$population, y = country.summary$suicides_no)
## [1] 0.8891355
If we wanted to be even more rigorous with our analysis, we could standardize our suicide rates by age distribution, as is done by the World Health Organization. This allows for better comparison of statistics between countries with different population age profiles.
The first thing that we can inspect is the progression of suicide rates worldwide during the time period 1985-2015.
global.suicide.rate <- data %>% group_by(year) %>%
summarise(
population = sum(population),
suicides_no = sum(suicides_no),
suicides_per_100k = (suicides_no / population) * 100000
)
global.suicide.rate %>%
ggplot(aes(x = year, y = suicides_per_100k)) +
geom_point(color="orange", size=2.4) +
geom_line(color="orange", size=1.1) +
labs(
title = "Worldwide Suicide Rates",
x = "Year", y = "Suicides per 100k people"
) +
scale_x_continuous(breaks = seq(1985, 2015, 2))
This figure shows that there has been a great rise in suicides throughout the years 1989 and 1995. However, with a few exceptions, the suicide rate seems to be on decline ever since then.
differences <- data.frame(
difference = tail(global.suicide.rate$suicides_per_100k, -1) - head(global.suicide.rate$suicides_per_100k, -1),
year = tail(global.suicide.rate$year, -1))
differences <- differences %>%
mutate(changeNotPositive = difference <= 0)
differences %>%
ggplot(aes(x = year, y = difference, color = changeNotPositive)) +
geom_segment(aes(x = year, xend = year, y = 0, yend = difference)) +
geom_point(size = 2) +
scale_x_continuous(breaks = seq(1986, 2015, 2)) +
scale_y_continuous(breaks = seq(-1, 1.5, 0.5)) +
theme(legend.position = "none") +
labs(title = "Change in suicide rate in time",
subtitle = "Shows the change in suicide rate from the previous year",
x = "Year", y = "Difference")
The figure above shows us the differences in suicide rates between two subsequent years.
Rather than inspecting the data worldwide, we can take a look at the progression of suicide rates within each of the continents and check wheter there appears to be any differences between them.
continent_overall_plot <- data %>% group_by(continent) %>%
summarise(
population = sum(population),
suicides_no = sum(suicides_no),
suicides_per_100k = (suicides_no / population) * 100000,
.groups = "keep"
) %>%
ggplot(aes(x = continent, y = suicides_per_100k, fill = continent)) +
geom_bar(stat = "identity") +
labs(title = "Overall differences in suicide rates between continents",
x = "Continent", y = "Suicides per 100k people") +
theme(legend.position = "none")
continent_by_year_plot <- data %>% group_by(year, continent) %>%
summarise(
population = sum(population),
suicides_no = sum(suicides_no),
suicides_per_100k = (suicides_no / population) * 100000,
.groups = "keep"
) %>%
ggplot(aes(x = year, y = suicides_per_100k, color=continent)) +
geom_line() +
geom_point(size=0.8) +
scale_x_continuous(breaks = seq(1985, 2015, 5)) +
facet_wrap(~ continent, ncol = 1, scales = "free_y") +
labs(title = "Suicide Rate By Continent In Time",
x = "Year", y = "Suicides per 100k people") +
theme(legend.position = "none")
grid.arrange(continent_overall_plot, continent_by_year_plot, ncol = 2)
It turns out that there certainly is a difference in suicide rates between the continents. It is worth noting that the data for Oceania and the Americas suggest that the suicide rates have been on rise in the most recent years, which means that these continents do not comply with the worldwide trend.
The reason that the worldwide trend suggests that suicide rates are decreasing is the decrease of suicide rates within European countries. The suicide rates in Europe are large enough to dominate the suicide rates within the other continents. Because of that, they also dictate the worldwide trend. Europe also has the highest overall suicide rate during the inspected time period.
I cannot recognize any trend within the Asian countries.
The data from African countries may be incomplete. There is a very pronounced and sudden decline in suicide rates after the year 1995. I do not think that this data represents the real situation within the African countries, and suspect that the quality of data collection and suicide reports might be the culprit here.
Let’s inspect the European countries in more detail, since Europe has the highest suicide rate in the world.
data_europe <- data %>% filter(continent == "Europe") %>%
group_by(country) %>%
summarise(
population = sum(population),
suicides_no = sum(suicides_no),
suicides_per_100k = (suicides_no / population) * 100000,
.groups = "keep"
) %>%
arrange(desc(suicides_per_100k))
data_europe %>%
ggplot(aes(x = reorder(country, suicides_per_100k), y = suicides_per_100k, fill = suicides_per_100k)) +
geom_bar(stat = "identity") +
scale_fill_viridis_c() +
coord_flip() +
theme(legend.position = "none") +
labs(title = "Overall Suicide Rates in European countries",
x = "Country", y = "Suicides per 100k people")
The country with the largest suicide rate is Lithuania, followed by the Russian Federation and Belarus. It seems that a large number of the countries with high suicide rate seem to be East European.
data_europe %>%
ggplot(aes(x = reorder(country, suicides_no), y = suicides_no, fill = suicides_no)) +
geom_bar(stat = "identity") +
scale_fill_viridis_c() +
coord_flip() +
theme(legend.position = "none") +
labs(title = "Number of suicides in European countries",
x = "Country", y = "Number of suicides")
data %>% filter(country == "Russian Federation") %>%
select(suicides_no) %>%
sum()
## [1] 1209742
data %>% filter(country != "Russian Federation") %>%
select(suicides_no) %>%
sum()
## [1] 5523075
Inspecting the concrete number of suicides in the European countries reveals that the Russian Federation is the country with the most suicides, and that the number of suicides is much greater than in the other countries. This can be explained by Russia being the European country with the largest population and also having the second highest overall suicide rate in Europe.
A significant factor in suicide risk can be the sex of the person.
data %>% group_by(year, sex) %>%
summarise(
population = sum(population),
suicides_no = sum(suicides_no),
suicides_per_100k = (suicides_no / population) * 100000,
.groups = "keep") %>%
ggplot(aes(x = year, y = suicides_per_100k, color=sex)) +
geom_line(size=1.1) +
geom_point(size=2.4) +
labs(title = "Global Differences in Suicide Rates Between Sexes",
x = "Year", y = "Suicides per 100k people")
It is clear that the male part of the population is at a higher risk of suicide than its female counterpart. Another somewhat interesting observation is that the change in suicide rate of women in time is not as pronounced as is the change of suicide rate of men.
data %>% group_by(continent, sex) %>%
summarise(
population = sum(population),
suicides_no = sum(suicides_no),
suicides_per_100k = (suicides_no / population) * 100000,
.groups = "keep"
) %>%
ggplot(aes(x = continent, y = suicides_per_100k, fill = sex)) +
geom_bar(stat = "identity", position = "stack") +
scale_y_continuous(breaks = seq(0, 40, 5)) +
coord_flip() +
theme(legend.position = "bottom") +
labs(title = "Overall differences in suicide rates between the continents",
subtitle = "along with differences between sexes within individual continents",
x = "Continent", y = "Suicides per 100k people")
We can also inspect the differences between the sexes on within individual continents. This again reveals the same pattern for all of the continents, that is that men are at higher risk of suicide than women.
data %>% group_by(continent, sex, year) %>%
summarise(
population = sum(population),
suicides_no = sum(suicides_no),
suicides_per_100k = (suicides_no / population) * 100000,
.groups = "keep"
) %>%
ggplot(aes(x = year, y = suicides_per_100k, color = sex)) +
geom_line() +
geom_point() +
facet_wrap(~ continent, ncol = 1, scales = "free_y", strip.position = "right") +
scale_x_continuous(breaks = seq(1985, 2015, 2)) +
theme(legend.position = "bottom")
By inspecting the progression of suicide rates of sexes within different continents, we can state that the suicide rates of women are consistently lower than those of men across all of the inspected continents.
Another factor that might play a role in the suicide risk is the age of the person.
data %>% group_by(year, age) %>%
summarise(
population = sum(population),
suicides_no = sum(suicides_no),
suicides_per_100k = (suicides_no / population) * 100000,
.groups = "keep") %>%
ggplot(aes(x = year, y = suicides_per_100k, color = age)) +
geom_line() +
geom_point(size = 0.9) +
scale_x_continuous(breaks = seq(1985, 2015, 2)) +
scale_color_discrete() +
labs(title = "Relationship between suicide rate and age",
x = "Year", y = "Suicides per 100k people")
It seems that if we inspect the suicide data globally, there in fact is a difference is suicide rates between different age groups. More over, the risk of suicide seems to grow larger with the age of the person.
data %>% group_by(continent, year, age) %>%
summarise(
population = sum(population),
suicides_no = sum(suicides_no),
suicides_per_100k = (suicides_no / population) * 100000,
.groups = "keep") %>%
ggplot(aes(x = year, y = suicides_per_100k, color = age)) +
geom_line() +
geom_point(size = 0.9) +
scale_x_continuous(breaks = seq(1985, 2015, 2)) +
scale_color_discrete() +
facet_wrap(~continent, ncol = 1) +
labs(title = "Relationship between suicide rate and age within individual continents",
x = "Year", y = "Suicides per 100k people")
However, if we inspect the dependence on suicide rate on age within individual continents, we can see that the previous observation does not hold true everywhere in the world.
The differences between age groups are most pronounced in the Asian and European countries. There is also some difference between the age groups in the Americas, though they do not appear as pronounced as in the two previously mentioned continents.
On the other hand, if we take a look at the data from Oceania, we can see that there does not appear to be any striking dependence of the suicide rate on the age group.
Using our dataset, we can also try to find out whether the gross domestic product of a country affects its suicide rate.
data %>% group_by(continent, country) %>%
summarise(
population = sum(population),
suicides_no = sum(suicides_no),
suicides_per_100k = (suicides_no / population) * 100000,
mean_gdp_per_capita = mean(gdp_per_capita),
.groups = "keep") %>%
ggplot(aes(x = mean_gdp_per_capita, y = suicides_per_100k, color = continent)) +
geom_point() +
labs(title = "Relationship between suicide rate and GDP per capita",
subtitle = "Both variables are represented as means from the time period 1985-2015",
x = "GDP per capita [$]", y = "Suicides per 100k people")
The first attempt to investigate the relationship between these two variables can be plotting the overall suicide rate of a country against its mean GDP per capita during the available time period. However, this figure does not seem to reveal any obvious relationship between these variables.
We can also try to calculate the correlations between these two variables over the same time interval for each of the countries individually. The correlations were computed using Spearman’s rho, since the relationship between the two variables is not expected to be linear. The correlations have been computed only for the countries where 10 or more years of data has been available.
correlation.gdp.suicide <- data.frame()
for (c in levels(data$country)) {
country.data <- data %>% filter(country == c) %>%
group_by(country, year, gdp_per_capita) %>%
summarise(
population = sum(population),
suicides_no = sum(suicides_no),
suicides_per_100k = (suicides_no / population) * 100000,
.groups = "keep")
if (nrow(country.data) < 10) {
next
}
corr <- cor.test(x = country.data$gdp_per_capita, y = country.data$suicides_per_100k, method = "spearman",
exact = FALSE)
correlation.gdp.suicide <- rbind(
correlation.gdp.suicide,
list(
country = c,
correlation = corr$estimate,
p.value = corr$p.value,
significant = corr$p.value < 0.05
)
)
}
correlation.gdp.suicide <- arrange(correlation.gdp.suicide, correlation.gdp.suicide$correlation)
head(correlation.gdp.suicide)
## country correlation p.value significant
## 1 Hungary -0.9176923 1.071471e-10 TRUE
## 2 Latvia -0.9155844 5.944568e-09 TRUE
## 3 Austria -0.9084677 1.653957e-12 TRUE
## 4 Estonia -0.9077922 1.333167e-08 TRUE
## 5 France -0.9003337 1.259076e-11 TRUE
## 6 Norway -0.8972191 1.899150e-11 TRUE
tail(correlation.gdp.suicide)
## country correlation p.value significant
## 85 Chile 0.7669355 4.852932e-07 TRUE
## 86 Montenegro 0.7852071 7.120284e-03 TRUE
## 87 Philippines 0.8178571 1.951365e-04 TRUE
## 88 Brazil 0.8532258 1.077011e-09 TRUE
## 89 Republic of Korea 0.8955645 1.028043e-11 TRUE
## 90 Mexico 0.9475806 6.593491e-16 TRUE
We can see that this provided us with a much more better insight into the data, and that there truly is a relationship between the suicide rate in a country and the gross domestic product of that country. This relationship seems to be quite strong for a considerable amount of countries. Some of the countries achieve a large positive correlation and some achieve a large negative correlation. However, they are not distributed equally, as negative correlations are present more often than the positive ones. All the correlations (with the exception of Cyprus) whose absolute value is larger than \(0.4\) are statistically significant with confidence level \(\alpha = 0.05\).
correlation.gdp.suicide %>%
filter(p.value < 0.05) %>%
mutate(isNotPositive = correlation <= 0) %>%
ggplot(aes(x = correlation, fill = isNotPositive)) +
geom_histogram(binwidth = 0.1, center = 0.1, color = "#e9ecef") +
scale_x_continuous(breaks = seq(-1, 1, 0.1)) +
scale_y_continuous(breaks = seq(0, 12, 2)) +
labs(title = "Histogram of statistically significant correlations of suicide rate and GDP") +
theme(legend.position = "none")
Let’s inspect this relationship further by plotting the suicide rate against GDP for six of the countries that achieved the largest negative correlations.
top.corr.countries <- head(correlation.gdp.suicide)$country
top.corr.correlations <- head(correlation.gdp.suicide)$correlation
my_labeller <- function(countries) {
correlation <- map(
countries,
function(x) correlation.gdp.suicide[correlation.gdp.suicide$country == x, ]$correlation
) %>% unlist
correlation <- round(correlation, digits = 2)
result <- paste(as.character(countries), ", rho = ", correlation, sep = "")
}
data %>% filter(country %in% top.corr.countries) %>%
group_by(country, year, gdp_per_capita) %>%
summarise(
population = sum(population),
suicides_no = sum(suicides_no),
suicides_per_100k = (suicides_no / population) * 100000,
.groups = "keep") %>%
mutate(country_order = factor(country, levels = top.corr.countries)) %>%
ggplot(aes(x = gdp_per_capita, y = suicides_per_100k, color = country_order)) +
geom_point() +
scale_colour_viridis_d(end = 0.8) +
facet_wrap(~ country_order, scales = "free",
labeller = as_labeller(my_labeller)) +
labs(title = "Countries with the largest negative correlation of gross domestic product and suicide rate",
subtitle = "Correlation is reported using Spearman's rho",
x = "GDP [$]", y = "Suicides per 100k people") +
theme(legend.position = "none")
We can see that plotting the data allows us to easily notice the relationship between these two variables. It is also worth noting that all of these countries achieving the largest negative correlations are European.
We can do the same for countries that achieved the largest positive correlation.
top.corr.countries <- head(arrange(correlation.gdp.suicide, desc(correlation.gdp.suicide$correlation)))
top.corr.countries <- top.corr.countries$country
my_labeller <- function(countries) {
correlation <- map(
countries,
function(x) correlation.gdp.suicide[correlation.gdp.suicide$country == x, ]$correlation
) %>% unlist
correlation <- round(correlation, digits = 2)
result <- paste(as.character(countries), ", rho = ", correlation, sep = "")
}
data %>% filter(country %in% top.corr.countries) %>%
group_by(country, year, gdp_per_capita) %>%
summarise(
population = sum(population),
suicides_no = sum(suicides_no),
suicides_per_100k = (suicides_no / population) * 100000,
.groups = "keep") %>%
mutate(country_order = factor(country, levels = top.corr.countries)) %>%
ggplot(aes(x = gdp_per_capita, y = suicides_per_100k, color = country_order)) +
geom_point() +
scale_colour_viridis_d(begin = 0.8, end = 0) +
facet_wrap(~ country_order, scales = "free",
labeller = as_labeller(my_labeller)) +
labs(title = "Countries with the largest correlation of gross domestic product and suicide rate",
subtitle = "Correlation is reported using Spearman's rho",
x = "GDP [$]", y = "Suicides per 100k people") +
theme(legend.position = "none")
Once again, the interaction between suicide rate and GDP is easily recognizable, although in some cases (e.g. Montenegro) it does not seem to be as well structured as it has been for the negative correlations.
Another thing I would like to investigate is whether there is a connection between the overall suicide rate of a country and its dependence on the GDP of the country.
data %>%
filter(country %in% correlation.gdp.suicide$country) %>%
group_by(continent, country) %>%
summarise(
population = sum(population),
suicides_no = sum(suicides_no),
suicides_per_100k = (suicides_no / population) * 100000,
gdp_per_capita = mean(gdp_per_capita),
.groups = "keep") %>%
inner_join(correlation.gdp.suicide, by = "country") %>%
# filter(p.value < 0.05) %>%
ggplot(aes(x = correlation, y = suicides_per_100k, color = continent, shape = significant)) +
geom_point() +
labs(title = "Relationship between suicide rate and its correlation with GDP",
subtitle = "Correlation is reported using Spearman's Rho",
x = "Correlation", y = "Suicides per 100k people")
The plot includes the correlations even if they were not statistically significant. It seems that the countries, where the suicide rate is highly correlated with the gross domestic product, tend to have higher suicide rates than the countries where the correlation with GDP is not very strong. This can be observed on both ends of the correlation’s value range, however it is far more pronounced for countries that achieve a negative correlation.
We can also observe that the suicide rates of countries that achieve a large negative correlation tend to be larger than of countries that achieve a large positive correlation.
Once again, it is worth noting that the majority of countries that achieve a negative correlation is European. Let’s see what correlations did individual European countries achieve.
countries.continent <- distinct(data[c("country", "continent")])
correlation.gdp.suicide %>%
inner_join(countries.continent, by = "country") %>%
filter(continent == "Europe") %>%
ggplot(aes(x = reorder(country, correlation), y = correlation, fill = correlation)) +
geom_bar(stat = "identity") +
scale_fill_viridis_c() +
scale_y_continuous(limits = c(-1, 1), breaks = seq(-1, 1, 0.1)) +
labs(title = "Correlations of European countries",
x = "Country", y = "Correlation") +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1),
legend.position = "none")
We can see that most of the European countries truly achieve large negative correlations of suicide rates and GDP. However, there are a few exceptions, where these two variables are correlated positively.
The plot of an average suicide rate of a country against its correlation with the country’s GDP per capita was really interesting to me, and therefore I would like to try to separate the individual countries using K-means clustering.
First, let us select the data that will be used for clustering.
clustering.data <- data %>%
filter(country %in% correlation.gdp.suicide$country) %>%
group_by(country) %>%
summarise(
population = sum(population),
suicides_no = sum(suicides_no),
suicides_per_100k = (suicides_no / population) * 100000,
gdp_per_capita = mean(gdp_per_capita)) %>%
inner_join(correlation.gdp.suicide, by = "country") %>%
mutate(suicides_per_100k = scale(suicides_per_100k),
correlation = scale(correlation)) %>%
dplyr::select("correlation", "suicides_per_100k")
We can utilize the elbow method in order to select the number of clusters.
set.seed(42)
k.max <- 15
wss <- sapply(1:k.max,
function(k) { kmeans(clustering.data, k, nstart=50,iter.max = 15 )$tot.withinss })
elbow <- data.frame(k = c(1:k.max), wss = wss)
elbow %>%
ggplot(aes(x = k, y = wss, color = "orange")) +
geom_line() +
geom_point() +
scale_x_continuous(breaks = seq(1, k.max, 1)) +
labs(title = "The change of the total within cluster sum of squares with number of clusters",
x = "Number of clusters", y = "Total within cluster sum of squares") +
theme(legend.position = "none")
We can see that the decline slows down after creating 4 clusters, and therefore we will set \(k = 4\).
clustering.data <- data %>%
filter(country %in% correlation.gdp.suicide$country) %>%
group_by(country) %>%
summarise(
population = sum(population),
suicides_no = sum(suicides_no),
suicides_per_100k = (suicides_no / population) * 100000,
gdp_per_capita = mean(gdp_per_capita)) %>%
inner_join(correlation.gdp.suicide, by = "country") %>%
mutate(suicides_per_100k = scale(suicides_per_100k),
correlation = scale(correlation)) %>%
dplyr::select("correlation", "suicides_per_100k")
set.seed(1234)
clustering <- kmeans(clustering.data, 4)
fviz_cluster(clustering, data = clustering.data, labelsize = 0) +
labs(title = "Clustering countries based on their dependence on GDP",
subtitle = "Both features have been normalized",
x = "GDP and Suicide rate correlation",
y = "Suicides per 100k people")
It looks like that the first cluster is comprised of countries where the suicide rate has a strong negative correlation with the GDP and the suicide rate is above average. The second cluster is also made out of countries with strong correlation, however it is positive instead of negative. The suicide rate is also higher than average for most of the countries within this cluster.
The remaining two clusters are composed of countries whose suicide rates are mostly below the average.
The \(\texttt{generation}\) feature is redundant, since we can assign a generation to a group of population by combining the information from the features \(\texttt{year}\) and \(\texttt{age}\).
Furthermore, the features \(\texttt{gdp_for_year}\) and \(\texttt{gdp_per_capita}\) are mutually redundant. That is because we can always use one of these features in combination with the sum of total population in a country in a certain year to get the value of the other feature, e.g. \(\texttt{gdp_per_capita} = \frac{\texttt{gdp_for_year}}{\texttt{total_population}}\).
data %>% group_by(country, year, gdp_for_year, gdp_per_capita) %>%
summarise(population = sum(population), .groups = "keep") %>%
mutate(x = as.integer(round(gdp_for_year / population)),
y = abs(x - gdp_per_capita)) %>%
pull(y) %>%
mean
## [1] 0
The exploratory data analysis helped us uncover many interesting insights. First of all, we have seen that the world experienced a sudden rise in suicide rates during 1989-1995, however these rates these have been on decline since then. We also inspected the difference in suicide rates between the individual continents. This revealed that Europe has the highest suicide rate, followed by Asia. Inspection of the suicide rates among European countries has shown us that Lithuania has the highest suicide rate, however the number of suicides has been the largest in Russia.
We have also seen that men are at a higher risk of committing suicide than women, and that this is true across all of the continents and the investigated time period.
It also became apparent that the age plays a role in the risk of committing suicide. There seems to be a difference in suicide rates between the age groups, although this difference has not been seen in every continent.
An interesting relationship between the suicide rate within a country and the country’s gross domestic product has been revealed as well. This relationship appears to be quite substantial in many cases, and is almost always present within the European countries. There also appears to be a connection between the correlation of suicide rate and GDP and the suicide rate as such.
I would like to test several hypotheses based on the analysis that we have produced earlier. The confidence level for all of the hypothesis tests is set to be \(\alpha = 0.05\).
The first hypothesis that we will test is that that men tend to commit suicide more often than women. Therefore, we will test if the suicide rate of men was higher than that of women during the time interval 1985-2015.
Let \(\mu_m\) and \(\mu_w\) be the means of suicide rates of men and women, respectively.
The null hypothesis is then \[ \mu_m \leq \mu_w \] and the alternative hypothesis is \[ \mu_m > \mu_w \] where \(\mu_m\) and \(\mu_w\) are the means of suicide rates of men and women during the time period 1985-2015, respectively. The confidence level is set to be \(\alpha = 0.05\).
In order to test this hypothesis, we will make use of the Welch’s Two Sample t-test, which is a modification of Student’s t-test supporting unequal variances of the samples.
sex.hypothesis.data <- data %>% group_by(year, sex) %>%
summarise(
population = sum(population),
suicides_no = sum(suicides_no),
suicides_per_100k = (suicides_no / population) * 100000,
.groups = "keep"
)
rates.male <- sex.hypothesis.data %>% filter(sex == "Male") %>% pull(suicides_per_100k)
rates.female <- sex.hypothesis.data %>% filter(sex == "Female") %>% pull(suicides_per_100k)
t.test(
x = rates.male,
y = rates.female,
alternative = "greater"
)
##
## Welch Two Sample t-test
##
## data: rates.male and rates.female
## t = 34.928, df = 32.481, p-value < 2.2e-16
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## 13.89987 Inf
## sample estimates:
## mean of x mean of y
## 20.599946 5.991947
We can see that the resulting p-value of the Welch’s Two Sample t-test is smaller than our confidence level, and because of that we can reject the null hypothesis in favor of the alternative hypothesis. We have confirmed our initial suspicion, that is that men have higher suicide rates than women.
Another interesting insight gained from the exploratory analysis was that there is an apparent difference in suicide rates between individual age groups.
Usually, we could use ANOVA to test if at least one of the groups differs significantly from the rest of the groups. Unfortunately, our data violates certain assumptions of this method, namely the normality assumption, which requires every group to be normally distributed.
We can easily show the violation of this assumption by inspecting one of the age groups, for example the group containing people older than 75 years. We can check whether it is normally distributed by taking a look at the histogram of differences of suicide rates throughout the years and their mean.
age.old.rates <- data %>%
filter(age == "75+") %>%
group_by(year) %>%
summarise(
population = sum(population),
suicides_no = sum(suicides_no),
suicides_per_100k = (suicides_no / population) * 100000)
(data.frame(residual = age.old.rates$suicides_per_100k - mean(age.old.rates$suicides_per_100k))) %>%
ggplot(aes(x = residual)) +
scale_fill_viridis_c() +
geom_histogram(binwidth = 2,
center = 1,
fill = viridis::viridis(1, begin = 0.7),
color = "#e9ecef") +
labs(title = "Histogram of differences of yearly suicide rates and their mean",
subtitle = "for people that are 75+ years old",
x = "Residual", y = "Count")
This histogram indicates that the distribution of suicide rates in fact does not originate from the normal distribution.
Due to this assumption being violated, we are forced to use a different method, the Kruskal-Wallis test, which can be used even if the normality assumption does not hold. The Kruska-Wallis test is a non-parametric test, and as such it does not assume anything about the distribution of the data. It allows us to test a null hypothesis that is somewhat similar to that of ANOVA.
Let \(g_1, \dots, g_6\) be the data of different age groups from our data. The null hypothesis is that the data in groups \(g_1, \dots, g_6\) comes from the same population. The alternative hypothesis is that at least one of the groups comes from a different population.
age.hypothesis.data <- data %>%
group_by(year, age) %>%
summarise(
population = sum(population),
suicides_no = sum(suicides_no),
suicides_per_100k = (suicides_no / population) * 100000,
.groups = "keep"
)
kruskal.test(suicides_per_100k ~ age, data = age.hypothesis.data)
##
## Kruskal-Wallis rank sum test
##
## data: suicides_per_100k by age
## Kruskal-Wallis chi-squared = 172.34, df = 5, p-value < 2.2e-16
The p-value of the test is less than our confidence level, and so we reject the null hypothesis in favour of the alternative hypothesis. The conclusion is that there is indeed a difference between the individual age groups.
During the analysis we have seen that countries that achieve large negative correlation have a larger suicide rate than countries that achieve a large positive correlation. It might be interesting to test this hypothesis.
Let us create two subsets of the countries based on their correlations. The first group will contain countries that achieved correlations smaller than \(-0.6\), and the other group will contain countries whose suicide rate and GDP correlation was larger than \(0.6\).
Let \(\mu_l\) and \(\mu_h\) be the mean suicide rates of countries of the first and second group, respectively.
The null hypothesis is then \[ \mu_l \leq \mu_h \] and the alternative hypothesis is \[ \mu_l > \mu_h \]
Once again, we will make use of the Welch’s Two Sample t-test.
# Extract the suicide rates of the countries with significant correlations
signif.corr <- data %>%
filter(country %in% correlation.gdp.suicide$country) %>%
group_by(continent, country) %>%
summarise(
population = sum(population),
suicides_no = sum(suicides_no),
suicides_per_100k = (suicides_no / population) * 100000,
gdp_per_capita = mean(gdp_per_capita),
.groups = "keep") %>%
inner_join(correlation.gdp.suicide, by = "country") %>%
filter(p.value < 0.05)
t.test(x = signif.corr %>% filter(correlation < -0.6) %>% pull(suicides_per_100k),
y = signif.corr %>% filter(correlation > 0.6) %>% pull(suicides_per_100k),
alternative = "greater")
##
## Welch Two Sample t-test
##
## data: signif.corr %>% filter(correlation < -0.6) %>% pull(suicides_per_100k) and signif.corr %>% filter(correlation > 0.6) %>% pull(suicides_per_100k)
## t = 3.366, df = 24.331, p-value = 0.001265
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## 4.067158 Inf
## sample estimates:
## mean of x mean of y
## 18.118587 9.851997
The resulting p-value is less than our predefined confidence level, therefore we reject the null hypothesis in favor of the alternative hypothesis.
Another hypothesis that I wanted to test is that countries with large absolute values of the suicide rate and GDP correlation have higher suicide rates than countries that have a small absolute correlation. However, I did not want to base this hypothesis on the data contained in this dataset, since the small correlations did not turn out to be statistically significant.
Having done the analysis of the dataset, it might also be interesting to construct a predictive model that would predict the suicide rate for a certain group of population.
For this purpose I will make use of LASSO, a regression method that performs an automatic feature selection. We will try to fit two different models and then evaluate them both using cross-validation and a test dataset.
First off, we will have to encode our ordinal and nominal variables, so that LASSO can make proper use of them.
# Compute the suicide rate
ml.data <- data %>%
mutate(suicide_rate = (suicides_no / population) * 100000)
selected.columns <- c("suicide_rate", "population", "age", "sex", "gdp_per_capita",
"gdp_for_year", "country", "continent", "year")
# Hot one encoding the nominal variables
dummies <- dummyVars( ~ sex + continent + country, data = ml.data)
ml.data.encoded.nominal <- predict(dummies, newdata = ml.data)
# Tidy up the column names so that no space characters are present
colnames(ml.data.encoded.nominal) <- make.names(colnames(ml.data.encoded.nominal), unique = TRUE)
# Append the encoded ordinal variables and put everything back together
ml.data.encoded <- as_tibble(cbind(ml.data[, c("population", "gdp_per_capita", "gdp_for_year", "year")],
age = as.numeric(ml.data$age),
generation = as.numeric(ml.data$generation),
ml.data.encoded.nominal))
The first model that we can try to fit using LASSO is a simple linear model.
set.seed(123)
formula = as.formula(" ~ .")
# Partition the dataset into the training and testing subsets
train.idx <- createDataPartition(ml.data$suicide_rate, p = 0.7, list = FALSE)
y.train <- ml.data[train.idx, ]$suicide_rate
y.test <- ml.data[-train.idx, ]$suicide_rate
# Encode the data into required model
X.train <- data.frame(ml.data.encoded[train.idx, ])
X.train <- sparse.model.matrix(formula, data = X.train)
X.test <- data.frame(ml.data.encoded[-train.idx, ])
X.test <- sparse.model.matrix(formula, data = X.test)
# Run cross-validated LASSO
lasso.base.model <- cv.glmnet(x = X.train, y = y.train, type.measure = "mse")
lasso.base.model
##
## Call: cv.glmnet(x = X.train, y = y.train, type.measure = "mse")
##
## Measure: Mean-Squared Error
##
## Lambda Index Measure SE Nonzero
## min 0.00626 77 177.4 6.664 108
## 1se 0.25851 37 183.5 7.325 74
# Performance on the test dataset
print("Test dataset performance")
## [1] "Test dataset performance"
prediction <- predict(lasso.base.model, s = lasso.base.model$lambda.1se, newx = X.test)
postResample(pred = prediction, y.test)
## RMSE Rsquared MAE
## 13.6377579 0.4932341 8.8856364
The most regularized fit has been achieved when lambda equals 0.2585122. The \(R^2\) of this model is fairly frequently around 0.45, which would mean that it explains 45% of the variance in our original data. I believe we can do better than this.
We have seen that the previous model did not fit the data very well. The model can be made more complex in order to provide it with more flexibility, which would hopefully allow it to perform better. We will extend the previous model with polynomial features.
set.seed(123)
# Create the model formula
# This formula includes all of the features as well as their squares and all of the
# interactions between different features up to second degree
formula <- as.formula(paste(" ~ .^2 + ",
paste("poly(",
colnames(ml.data.encoded),
", degree = 2, raw = TRUE)[, 2]",
collapse = " + ")))
# Partition the dataset into the training and testing subsets
train.idx <- createDataPartition(ml.data$suicide_rate, p = 0.7, list = FALSE)
y.train <- ml.data[train.idx, ]$suicide_rate
y.test <- ml.data[-train.idx, ]$suicide_rate
X.train <- data.frame(ml.data.encoded[train.idx, ])
X.train <- sparse.model.matrix(formula, data = X.train)
X.test <- data.frame(ml.data.encoded[-train.idx, ])
X.test <- sparse.model.matrix(formula, data = X.test)
# Run the cross-validated LASSO
lasso.poly.model <- cv.glmnet(x = X.train, y = y.train, type.measure = "mse")
lasso.poly.model
##
## Call: cv.glmnet(x = X.train, y = y.train, type.measure = "mse")
##
## Measure: Mean-Squared Error
##
## Lambda Index Measure SE Nonzero
## min 0.00105 100 76.57 3.546 857
## 1se 0.04350 60 80.02 3.937 360
cat("\nMost regularized fit attained for lambda = ")
##
## Most regularized fit attained for lambda =
cat(lasso.poly.model$lambda.1se)
## 0.04349984
We can see that the most regularized fit has been attained with lambda equal to 0.0434998.
print("Test dataset performance")
## [1] "Test dataset performance"
prediction <- predict(lasso.poly.model, s = lasso.poly.model$lambda.1se, newx = X.test)
postResample(pred = prediction, y.test)
## RMSE Rsquared MAE
## 8.9030650 0.7834786 5.2853417
Based on the performance of this model on the test dataset, we can say that it is indeed better at making predictions than our previous model was.
Let’s take a look at what features where assigned the largest coefficients.
coeffs.poly <- coef(lasso.poly.model, s = lasso.poly.model$lambda.1se)
head(coeffs.poly[order(abs(coeffs.poly), decreasing = TRUE), ], n = 20)
## <sparse>[ <logic> ] : .M.sub.i.logical() maybe inefficient
## (Intercept) sex.Male:country.Sri.Lanka
## 142.252521 29.491684
## sex.Male:country.Kazakhstan sex.Male:country.Russian.Federation
## 24.680994 24.644060
## sex.Male:country.Lithuania sex.Male:country.Hungary
## 22.231592 19.812583
## sex.Male:country.Estonia sex.Male:country.Latvia
## 18.745390 17.025602
## sex.Male:country.Slovenia sex.Male:country.Belarus
## 16.578098 14.922635
## sex.Female:country.Lithuania sex.Male:country.Finland
## -14.598312 12.424431
## sex.Male:country.Austria sex.Male:country.Uruguay
## 12.116756 11.531081
## sex.Female:country.Republic.of.Korea sex.Male:country.Ukraine
## -11.436428 10.932478
## sex.Female:country.Belarus sex.Male:country.Cuba
## -10.861750 9.963613
## sex.Male:country.South.Africa sex.Male:country.Greece
## -9.795208 -9.574332
It seems that the model decided to base its prediction mainly on the interactions between the categorical features sex and country. I am surprised that large coefficients were not assigned to features containing the \(\texttt{gdp_per_capita}\) features, as we have seen that it is often correlated with the suicide rate.
After exploring this dataset we can say that we have found several suicide patterns. It came to our attention that men are significantly more likely to commit suicide than women, which we have also tested using a t-test. This was also found to be true by the WHO suicide report. We also revealed that there is a difference in suicide rates between different age categories, and that mainly in Europe and Asia, the suicide risk grows larger with the age of a person. Last but not least, suicide rates within many countries were strongly correlated with the GDP per capita of the country.
We found that Europe has the highest overall suicide rate out of all of the continents, with Russia and Lithuania having the highest suicide rates.
While working with this dataset, I wondered what other patterns or connections we might be able to find, given that we would expand our dataset with new data. For example, It would be interesting to check whether there is a relationship of a suicide rate within a country and the country’s happiness index. I would also be interested in the connection between a country’s suicide rate and the alcohol consumption level within its population.